0.1 Executive Summary

We have set out to predict the total cost for two people to stay in an Airbnb in the city of Milan for 4 nights. To select properties that are suitable we have ensured that all have private rooms, a rating of 4.5 or greater and have at least 10 reviews.

In order to undertake the analysis we have conducted a thorough exploration of the data to gain an understand of the relevant variables.During out EDA we found that a number of variables impact the price of a room in Milan. The number, and type, of services that are provided with the rental impact the price of the stay positively - this is likely due to the costs associated with these services.The type of property has a large impact on the price of the stay with Hotel rooms and entire lofts commanding the largest premiums. Neighborhoods play a large role in the price of the Airbnb allowing hosts to command higher pricess due to their location - Trei Torri, an affluent modern neighborhood, commands on average the highest prices by room.

Through creation of 8 models we were able to demonstrate and understand how a number of different variables impacted the price of our desired stay. To do this we chose between room type and property type, in a simplified version. This allowed us to consider 4 variables - bathrooms, bedrooms, beds and accomodates (the number of people the property could host). From this we selected bedrooms to run our regression analysis. Subsequently we found that the most statistically significant model was model 8 with an R Squared value of 0.304 - the highest we derived from a model. From this model we were able to find c.1100 properties in Milan that were suitable for 2 people staying 4 nights. From these we have also been able to illustrate the distribution of prices from suitable properties.

The following report walks you through our process, exploration, analysis and outputs.

1 Exploratory Data Analysis (EDA)

#Exploratory Data Analysis for Airbnb properties in Milan

##Let’s look at the raw data

glimpse(listings)
Rows: 17,703
Columns: 74
$ id                                           <dbl> 6400, 23986, 28300, 37256~
$ listing_url                                  <chr> "https://www.airbnb.com/r~
$ scrape_id                                    <dbl> 2.021092e+13, 2.021092e+1~
$ last_scraped                                 <date> 2021-09-20, 2021-09-20, ~
$ name                                         <chr> "The Studio Milan", "\" C~
$ description                                  <chr> "Enjoy your stay at The S~
$ neighborhood_overview                        <chr> "The neighborhood is quie~
$ picture_url                                  <chr> "https://a0.muscache.com/~
$ host_id                                      <dbl> 13822, 95941, 121663, 119~
$ host_url                                     <chr> "https://www.airbnb.com/u~
$ host_name                                    <chr> "Francesca", "Jeremy", "M~
$ host_since                                   <date> 2009-04-17, 2010-03-19, ~
$ host_location                                <chr> "Milan, Lombardia, Italy"~
$ host_about                                   <chr> "I'm am Francesca Sottila~
$ host_response_time                           <chr> "N/A", "N/A", "N/A", "N/A~
$ host_response_rate                           <chr> "N/A", "N/A", "N/A", "N/A~
$ host_acceptance_rate                         <chr> "N/A", "N/A", "N/A", "N/A~
$ host_is_superhost                            <lgl> FALSE, FALSE, FALSE, TRUE~
$ host_thumbnail_url                           <chr> "https://a0.muscache.com/~
$ host_picture_url                             <chr> "https://a0.muscache.com/~
$ host_neighbourhood                           <chr> "Zona 5", "Navigli", "Cen~
$ host_listings_count                          <dbl> 1, 1, 1, 2, 2, 2, 4, 1, 0~
$ host_total_listings_count                    <dbl> 1, 1, 1, 2, 2, 2, 4, 1, 0~
$ host_verifications                           <chr> "['email', 'phone', 'revi~
$ host_has_profile_pic                         <lgl> TRUE, TRUE, TRUE, TRUE, T~
$ host_identity_verified                       <lgl> FALSE, TRUE, TRUE, TRUE, ~
$ neighbourhood                                <chr> "Milan, Lombardy, Italy",~
$ neighbourhood_cleansed                       <chr> "TIBALDI", "NAVIGLI", "SA~
$ neighbourhood_group_cleansed                 <lgl> NA, NA, NA, NA, NA, NA, N~
$ latitude                                     <dbl> 45.44195, 45.44991, 45.47~
$ longitude                                    <dbl> 9.17797, 9.17597, 9.17359~
$ property_type                                <chr> "Private room in rental u~
$ room_type                                    <chr> "Private room", "Entire h~
$ accommodates                                 <dbl> 1, 4, 2, 1, 4, 4, 5, 3, 2~
$ bathrooms                                    <lgl> NA, NA, NA, NA, NA, NA, N~
$ bathrooms_text                               <chr> "3.5 baths", "1 bath", "1~
$ bedrooms                                     <dbl> 3, 1, 1, 1, 2, 2, 2, 2, 1~
$ beds                                         <dbl> 1, 1, 2, 1, 4, 2, 3, 1, 1~
$ amenities                                    <chr> "[\"Hangers\", \"Iron\", ~
$ price                                        <chr> "$100.00", "$150.00", "$1~
$ minimum_nights                               <dbl> 4, 1, 1, 2, 3, 2, 2, 3, 2~
$ maximum_nights                               <dbl> 5, 730, 14, 730, 90, 30, ~
$ minimum_minimum_nights                       <dbl> 4, 1, 1, 2, 3, 2, 2, 3, 2~
$ maximum_minimum_nights                       <dbl> 4, 1, 1, 2, 3, 2, 2, 3, 2~
$ minimum_maximum_nights                       <dbl> 5, 730, 14, 1125, 90, 30,~
$ maximum_maximum_nights                       <dbl> 5, 730, 14, 1125, 90, 30,~
$ minimum_nights_avg_ntm                       <dbl> 4, 1, 1, 2, 3, 2, 2, 3, 2~
$ maximum_nights_avg_ntm                       <dbl> 5, 730, 14, 1125, 90, 30,~
$ calendar_updated                             <lgl> NA, NA, NA, NA, NA, NA, N~
$ has_availability                             <lgl> TRUE, TRUE, TRUE, TRUE, T~
$ availability_30                              <dbl> 23, 28, 30, 0, 0, 23, 0, ~
$ availability_60                              <dbl> 53, 58, 60, 0, 0, 53, 0, ~
$ availability_90                              <dbl> 83, 88, 90, 0, 0, 83, 0, ~
$ availability_365                             <dbl> 358, 363, 365, 0, 203, 35~
$ calendar_last_scraped                        <date> 2021-09-20, 2021-09-20, ~
$ number_of_reviews                            <dbl> 12, 15, 8, 34, 37, 14, 27~
$ number_of_reviews_ltm                        <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0~
$ number_of_reviews_l30d                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ first_review                                 <date> 2014-04-11, 2015-09-21, ~
$ last_review                                  <date> 2010-04-19, 2020-09-07, ~
$ review_scores_rating                         <dbl> 4.89, 4.64, 4.71, 4.90, 4~
$ review_scores_accuracy                       <dbl> 5.00, 4.53, 4.71, 4.79, 4~
$ review_scores_cleanliness                    <dbl> 5.00, 4.40, 4.86, 4.90, 4~
$ review_scores_checkin                        <dbl> 5.00, 4.40, 4.86, 5.00, 5~
$ review_scores_communication                  <dbl> 5.00, 4.53, 4.86, 5.00, 4~
$ review_scores_location                       <dbl> 4.56, 4.53, 5.00, 5.00, 4~
$ review_scores_value                          <dbl> 4.67, 4.53, 5.00, 4.59, 4~
$ license                                      <chr> NA, NA, NA, NA, NA, NA, N~
$ instant_bookable                             <lgl> FALSE, FALSE, FALSE, TRUE~
$ calculated_host_listings_count               <dbl> 1, 1, 1, 2, 2, 1, 1, 1, 1~
$ calculated_host_listings_count_entire_homes  <dbl> 0, 1, 0, 1, 2, 1, 1, 1, 1~
$ calculated_host_listings_count_private_rooms <dbl> 1, 0, 1, 1, 0, 0, 0, 0, 0~
$ calculated_host_listings_count_shared_rooms  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ reviews_per_month                            <dbl> 0.13, 0.21, 0.11, 0.47, 0~

1.1 Let’s take a closer look.

1.1.1 First, we need to convert the variable price into a number, so we can visualize the data. Moreover, we didn’t use “skim” because it reported a lot of information we deemed as unnecessar.

listings <- listings %>% 
  mutate(price = parse_number(as.character(price)))
favstats(price ~ bedrooms, data=listings)

bedroomsminQ1medianQ3maxmeansdnmissing
19      50       75       114       1.2e+04107       247      129840
212      80       120       195       1e+04      177       370      27190
316      116       190       300       1e+04      301       713      4630
417      146       247       400       2.5e+03335       343      1130
523      115       300       482       1.2e+03333       282      240
690      191       292       392       493      292       285      20
7394      1.17e+031.95e+032.72e+033.5e+031.95e+032.2e+0320
8732      732       732       732       732      732             10
168.5e+038.5e+03 8.5e+03 8.5e+03 8.5e+038.5e+03       10
## Data description

There are 74 variables and 17,703 observations within the AirBnB dataset.

The following variables are numbers.

#Returning indicator names with type dbl
listings %>%
  select(where(is.numeric))%>%
  colnames()
 [1] "id"                                          
 [2] "scrape_id"                                   
 [3] "host_id"                                     
 [4] "host_listings_count"                         
 [5] "host_total_listings_count"                   
 [6] "latitude"                                    
 [7] "longitude"                                   
 [8] "accommodates"                                
 [9] "bedrooms"                                    
[10] "beds"                                        
[11] "price"                                       
[12] "minimum_nights"                              
[13] "maximum_nights"                              
[14] "minimum_minimum_nights"                      
[15] "maximum_minimum_nights"                      
[16] "minimum_maximum_nights"                      
[17] "maximum_maximum_nights"                      
[18] "minimum_nights_avg_ntm"                      
[19] "maximum_nights_avg_ntm"                      
[20] "availability_30"                             
[21] "availability_60"                             
[22] "availability_90"                             
[23] "availability_365"                            
[24] "number_of_reviews"                           
[25] "number_of_reviews_ltm"                       
[26] "number_of_reviews_l30d"                      
[27] "review_scores_rating"                        
[28] "review_scores_accuracy"                      
[29] "review_scores_cleanliness"                   
[30] "review_scores_checkin"                       
[31] "review_scores_communication"                 
[32] "review_scores_location"                      
[33] "review_scores_value"                         
[34] "calculated_host_listings_count"              
[35] "calculated_host_listings_count_entire_homes" 
[36] "calculated_host_listings_count_private_rooms"
[37] "calculated_host_listings_count_shared_rooms" 
[38] "reviews_per_month"                           

The following variables are categorical/factor.

#Returning indicator names with type character
listings %>%
  select(where(is.character))%>%
  colnames()
 [1] "listing_url"            "name"                   "description"           
 [4] "neighborhood_overview"  "picture_url"            "host_url"              
 [7] "host_name"              "host_location"          "host_about"            
[10] "host_response_time"     "host_response_rate"     "host_acceptance_rate"  
[13] "host_thumbnail_url"     "host_picture_url"       "host_neighbourhood"    
[16] "host_verifications"     "neighbourhood"          "neighbourhood_cleansed"
[19] "property_type"          "room_type"              "bathrooms_text"        
[22] "amenities"              "license"               

##Let’s understand better the data with the use of some graphs

###Here we have a barchart of the number of bedrooms, but we remove very large numbers, so in our case the properties that have more than 5 bedrooms.

listings %>% 
  filter(bedrooms<=5) %>% 
  ggplot(aes(x=bedrooms))+
  geom_bar()+
  labs(title="Number of Airbnb properties in Milan grouped by bedrooms", x="Bedrooms",y="Number of properties")+
  NULL

###Here we have a histogram to understand the distribution of the average reviews for properties in Milan. As we can see from the graph, the vast majority of properties have ratings above 4.

listings %>% 
  ggplot(aes(x=review_scores_rating))+
  geom_histogram()+
  labs(title="Distribution of ratings per Airbnb property in Milan", x="Ratings",y="Number of properties")+
  NULL

###Here we have a box plot to understand the distribution of the number of ratings per Airbnb property. We filter out the data and only analyze properties that have more than 100 reviews so to remove the properties that haven’t been long enough on the “market” and hence haven’t been used a lot.

listings %>%
  filter(number_of_reviews>=100) %>% 
  ggplot(aes(x=number_of_reviews))+
  geom_boxplot()+
  labs(title="Boxplot of the number of reviews per Airbnb property in Milan", x="Number of Reviews")+
  NULL

###Here we have a density plot to understand the distribution of price per Airbnb property. We filter out the data and only analyze properties that have a price per night of less than 300 so to remove the outliers made by the properties that can be considered as “luxury”.

listings <- listings %>% #Changing price from str to numeric data type
  mutate(price = parse_number(as.character(price))) %>% 
  mutate(neighbourhood_simplified = ifelse(longitude <= 9.17279 & latitude <= 45.462395, "Southwest", 
         ifelse(longitude <= 9.17279 & latitude > 45.462395, "Northwest",
         ifelse(longitude > 9.17279 & latitude <= 45.462395, "Southeast", "Northeast"))))


listings %>%
  filter(price<=300) %>% 
  ggplot(aes(x=price))+
  geom_density()+
  labs(title="Distribution of the price per night per Airbnb property in Milan", x="Price per night",y="Density")+
  NULL

## Let’s look at the property types in more detail. Here are some numbers:

proportion_listing <- listings %>%
  group_by(property_type) %>%
  count() %>%
  mutate(pct = scales::percent(n / 17703))

proportion_listing %>%
  arrange(desc(n))
# A tibble: 52 x 3
# Groups:   property_type [52]
   property_type                           n pct  
   <chr>                               <int> <chr>
 1 Entire rental unit                  10178 57%  
 2 Private room in rental unit          2661 15%  
 3 Entire condominium (condo)           1830 10%  
 4 Entire loft                           833 5%   
 5 Private room in condominium (condo)   631 4%   
 6 Entire residential home               274 2%   
 7 Entire serviced apartment             189 1%   
 8 Shared room in rental unit            182 1%   
 9 Private room in residential home      168 1%   
10 Private room in bed and breakfast     137 1%   
# ... with 42 more rows

The 4 most common property types are ‘entire rental unit’, ‘private room in rental unit’, ‘entire condo’ and ‘entire loft’. These property types make up a combined 87% of the properties. (57%, 15%, 10% and 5% respectively).

Since the vast majority of the observations in the data are one of the top four or five property types, we have chosen to create a simplified version of property_type variable that has 5 categories: the top four categories and Other.

listings <- listings %>%
  mutate(prop_type_simplified = case_when(
    property_type %in% c("Entire rental unit","Private room in rental unit", "Entire condominium (condo)","Entire loft") ~ property_type, 
    TRUE ~ "Other"
  ))
listings %>%
  count(property_type, prop_type_simplified) %>%
  arrange(desc(n)) 
property_typeprop_type_simplifiedn
Entire rental unitEntire rental unit10178
Private room in rental unitPrivate room in rental unit2661
Entire condominium (condo)Entire condominium (condo)1830
Entire loftEntire loft833
Private room in condominium (condo)Other631
Entire residential homeOther274
Entire serviced apartmentOther189
Shared room in rental unitOther182
Private room in residential homeOther168
Private room in bed and breakfastOther137
Private room in loftOther91
Room in boutique hotelOther56
Room in hotelOther56
Private room in serviced apartmentOther36
Shared room in condominium (condo)Other33
Private room in villaOther30
Tiny houseOther26
Room in aparthotelOther24
Entire guest suiteOther22
Private room in guest suiteOther21
Private room in townhouseOther20
Entire villaOther19
Room in bed and breakfastOther19
Entire townhouseOther18
Private room in hostelOther17
Room in serviced apartmentOther17
Shared room in hostelOther17
Room in hostelOther12
Private roomOther10
Entire placeOther9
Shared room in loftOther9
Private room in tiny houseOther8
Shared room in residential homeOther8
Private room in casa particularOther5
Shared room in bed and breakfastOther5
Private room in guesthouseOther4
Casa particularOther3
Entire bed and breakfastOther3
Entire guesthouseOther3
Entire home/aptOther3
Shared room in tiny houseOther3
Camper/RVOther2
Dome houseOther2
BoatOther1
CaveOther1
Earth houseOther1
IslandOther1
Private room in camper/rvOther1
Private room in caveOther1
Private room in earth houseOther1
Private room in farm stayOther1
TipiOther1

1.2 Last but not least, let’s look at the correlation between the variables in our dataset. It is important to do so before the regression to understand whether there is collinearity among predictors.

We will now look at the correlation between selected variables in the dataset.

listings %>% #Correlation between availability and price
  select(where(is.numeric)) %>% 
  select(price, availability_30,availability_60,availability_90,availability_365) %>% 
  ggpairs(aes(alpha=0.2))+
  theme_bw()

As per the graph the correlation between availability and price is not significantly high. This highlights that availablity of rooms does not affect the price.

listings %>% #Correlation between review and price 
  select(price, bedrooms,beds,review_scores_rating,review_scores_accuracy, review_scores_cleanliness,review_scores_checkin,
         review_scores_communication,review_scores_location,review_scores_value ) %>% 
  ggpairs(aes(alpha=0.2))+
  theme_bw()

As per the graph the correlation between ratings and price is not significantly high. This highlights that potentially lowered priced rooms receive a high rating, this signifies that customers care about value for money. There exists a significant correlation between the number of beds and price.

listings %>%
  group_by(prop_type_simplified) %>%
  summarise(avg_price = mean(price)) %>%
  ggplot(aes(x = prop_type_simplified, y = avg_price)) +
  geom_col() +
  labs(title = "Average Property Price of Different Property Types", 
       x = "Property Type",
       y = "Average Price Per Night") 

The barchart shown above implies that the entire loft would have the highest average price among all the property type, while private room in rental unit ranked the lowest. That makes sense to me since loft tends to have modern furniture than traditional type of building especially in European historic old cities like Milan. Also, loft is more spacious than other types, based on the personal experience of Francesco (our Italian group member). In addition, private room needs to share the living room with other tenants, which would reduce the comfortness of customers.

listings %>%
  group_by(room_type) %>%
  summarise(avg_price = mean(price)) %>%
  ggplot(aes(x = room_type, y = avg_price)) +
  geom_col() +
  labs(title = "Average Property Price of Different Room Types", 
       x = "Room Type",
       y = "Average Price Per Night") 

The barchart shown above implies that the hotel room has a much higher average price than any other room type, since customers need to pay for the premium of cleaning, security, free breakfast etc. In comparison, shared room has the lowest average price among all types, since the space needs to be shared with someone else.

listings %>%
  group_by(neighbourhood_cleansed) %>%
  summarise(avg_price = mean(price)) %>%
  ggplot(aes(x = avg_price, y = neighbourhood_cleansed)) +
  geom_col() +
  labs(title = "Average Property Price of Different Neighbourhoods", 
       x = "Neighbour",
       y = "Average Price Per Night") 

Tre Torri has the highest average property price among all the neighbors. Tre Torri is located in the centre of the three towers, which can serve a substantial number of employees working in high-caliber companies. The facilities in this area is extremely modern, with only 14 years of history after groundbreaking, accompanied with a lot of parks for entertainment. Ronchetto delle Rane, on the other hand, has the lowest average property price, since it’s located in suburb of Milan with outdated facilities.Tre Torri has the highest average property price among all the neighbors. Tre Torri is located in the centre of the three towers, which can serve a substantial number of employees working in high-caliber companies. The facilities in this area is extremely modern, with only 14 years of history after groundbreaking, accompanied with a lot of parks for entertainment. Ronchetto delle Rane, on the other hand, has the lowest average property price, since it’s located in suburb of Milan with outdated facilities.

correlation_matrix_data_1 <- listings %>% 
  select(price,bedrooms, accommodates)
corr <- round(cor(correlation_matrix_data_1), 1)
ggcorrplot(corr)  

1.3 Data wrangling

#Changing price from str to numeric data type 
listings <- listings %>% 
  mutate(price = parse_number(as.character(price)))
typeof(listings$price)
[1] "double"

We have confirmed that price is now formatted as a number.

Airbnb is most commonly used for travel purposes, i.e., as an alternative to traditional hotels. We only want to include listings in our regression analysis that are intended for travel purposes:

The minimum nights that the Airbnb reported the most usually lies between 1 and 3.

nights_listing <- listings %>%
  group_by(minimum_nights) %>%
  count() %>%
  mutate(pct = scales::percent(n / 17703))

nights_listing %>%
  arrange(desc(n))
# A tibble: 69 x 3
# Groups:   minimum_nights [69]
   minimum_nights     n pct  
            <dbl> <int> <chr>
 1              1  6853 39%  
 2              2  5548 31%  
 3              3  2246 13%  
 4              4   653 4%   
 5              5   571 3%   
 6              7   459 3%   
 7             30   291 2%   
 8              6   164 1%   
 9             15   145 1%   
10             29   122 1%   
# ... with 59 more rows

The number of minimum nights that stands out is 30 days. A possible explanation is that the host prefers long term lettings. Furthermore, Airbnb wants them to stay longer; in that way, the capacity of the property can be increased, reducing the business risk. Another stand out duration of stay is the minimum of 7 nights, which is above a minimum of 6 nights, encouraging people to stay one entire week benefiting the host to reduce hassle.

We have filtered the data so that it shows the minimum nights as less than or equal to 4 nights.

listings_4nights <- listings %>%
  filter(minimum_nights <= 4)

#Check if we have derived the dataset that included minimum_nights <= 4 only
listings_4nights %>%
  group_by(minimum_nights) %>%
  count()
# A tibble: 4 x 2
# Groups:   minimum_nights [4]
  minimum_nights     n
           <dbl> <int>
1              1  6853
2              2  5548
3              3  2246
4              4   653
listings %>% 
  filter(minimum_nights <= 4) %>% 
  ggplot(aes(x=minimum_nights))+
  geom_bar()+
  labs(title="Number of properties in Milan grouped by minimum nights", 
       subtitle="We only consider properties that have 4 or fewer minimum nights", 
       x="Minimum nights",
       y="Number of properties")+
  NULL

2 Mapping

leaflet(data = filter(listings, minimum_nights <= 4)) %>% 
  addProviderTiles("OpenStreetMap.Mapnik") %>% 
  addCircleMarkers(lng = ~longitude, 
                   lat = ~latitude, 
                   radius = 1, 
                   fillColor = "blue", 
                   fillOpacity = 0.4, 
                   popup = ~listing_url,
                   label = ~property_type)

3 Regression Analysis

We have created a new variable called ‘price_4_nights’ using ‘price’ and ‘accomodates’ to calculate the total cost for two people to stay at the Airbnb property for 4 nights.

listings_4_nights_2_people <- listings %>%
  filter(minimum_nights <= 4 , maximum_nights >= 4, accommodates >=2)
  
listings_4_nights_2_people <-  listings_4_nights_2_people %>% 
  mutate(price_4_nights = price*4)

We should use og adjusted prices for the regression analysis as the variable is exnibiting a normal distribution.

ggplot(data=listings_4_nights_2_people, aes(x= price_4_nights)) +
  geom_histogram() +
  scale_x_continuous(limits=c(0,1000)) +
  labs(title = 'Price distribution for accomodations in Milan for 4 days and 2 people', x = "Price", y = "Count") +
  theme_bw()

ggplot(data=listings_4_nights_2_people, aes(x= log(price_4_nights))) +
  geom_histogram() +
  scale_x_continuous() +
  labs(title = 'Log adjusted price distribution for accomodations in Milan for 4 days and 2 people', x = "Price", y = "Count") +
  theme_bw()

Comment: We would choose to use log(price_4_nights) for regression purpose, since we would derive a normal-distributed graph after taking the log of price. By doing so, the model is more consistent with the typical assumption of OLS analysis.

On the other hand, distribution of the price_4_nights is right-skewed, which would lead to the distortion the regression model (the coefficient would tend to be overvalued).

We have created a regression model called model1 with the following explanatory variables: prop_type_simplified, number_of_reviews, and review_scores_rating.

log_listings_4_nights_2_people <-  listings_4_nights_2_people %>%  #Model 1 - Type of listing
  mutate(price_4_nights = log(price_4_nights))

model1 <- lm(price_4_nights ~ 
               prop_type_simplified + 
               number_of_reviews + 
               review_scores_rating, 
             data = log_listings_4_nights_2_people)

log_listings_4_nights_2_people %>%
  group_by(prop_type_simplified) %>%
  summarise(count=n())
prop_type_simplifiedcount
Entire condominium (condo)1475
Entire loft717
Entire rental unit8629
Other1518
Private room in rental unit1722
autoplot(model1)+ theme_bw()

get_regression_table(model1) 
termestimatestd_errorstatisticp_valuelower_ciupper_ci
intercept6.01 0.039153   0    5.94 6.09 
prop_type_simplified: Entire loft0.1730.0325.350    0.11 0.236
prop_type_simplified: Entire rental unit0.0940.0214.470    0.0530.135
prop_type_simplified: Other-0.0930.027-3.430.001-0.146-0.04 
prop_type_simplified: Private room in rental unit-0.3890.026-15   0    -0.44 -0.339
number_of_reviews-0.0010    -14.4 0    -0.001-0.001
review_scores_rating-0.0290.007-3.890    -0.043-0.014
get_regression_summaries(model1)
r_squaredadj_r_squaredmsermsesigmastatisticp_valuedfnobs
0.0810.080.3670.6060.606159061.09e+04
mosaic::msummary(model1)
                                                  Estimate Std. Error t value
(Intercept)                                      6.013e+00  3.935e-02 152.820
prop_type_simplifiedEntire loft                  1.731e-01  3.234e-02   5.351
prop_type_simplifiedEntire rental unit           9.364e-02  2.095e-02   4.470
prop_type_simplifiedOther                       -9.258e-02  2.700e-02  -3.428
prop_type_simplifiedPrivate room in rental unit -3.895e-01  2.589e-02 -15.046
number_of_reviews                               -1.192e-03  8.293e-05 -14.376
review_scores_rating                            -2.868e-02  7.372e-03  -3.891
                                                Pr(>|t|)    
(Intercept)                                      < 2e-16 ***
prop_type_simplifiedEntire loft                 8.94e-08 ***
prop_type_simplifiedEntire rental unit          7.90e-06 ***
prop_type_simplifiedOther                       0.000609 ***
prop_type_simplifiedPrivate room in rental unit  < 2e-16 ***
number_of_reviews                                < 2e-16 ***
review_scores_rating                            0.000100 ***

Residual standard error: 0.606 on 10864 degrees of freedom
  (3190 observations deleted due to missingness)
Multiple R-squared:  0.08094,   Adjusted R-squared:  0.08043 
F-statistic: 159.5 on 6 and 10864 DF,  p-value: < 2.2e-16
car::vif(model1)  
                         GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 1.006827  4        1.000851
number_of_reviews    1.013385  1        1.006670
review_scores_rating 1.009113  1        1.004546

Comment:Review_scores_rating is negatively correlated with the price, since the t-stat is negative. The review_scores_rating is significant is in predicting the price, as it has a absolute t-stat of 3.891 (which is greater than 2, the t-value corrsponding to the 95% confidence level).

prop_type_simplified is statistically significant in predicting the price, since all of the property types (including entire loft, entire rental unit, other, and private room in rental unit) has an absolute t-value which is greater than 2. According to their signs, we are confident in concluding that entire loft and entire rental unit would contribute to the increase in price, while private room in rental unit and other type would lead to the decrease in price. Among all the property type, private room in rental unit would make the hugest impact on price, as suggested by the size of coefficient (-15.046).

We want to determine if room_type is a significant predictor of the cost for 4 nights, given everything else in the model. We have created a regression model called model2 that includes all of the explananatory variables in model1 plus room_type.

model2 <- lm(price_4_nights ~ 
               prop_type_simplified + 
               number_of_reviews + 
               review_scores_rating + 
               room_type, 
             data = log_listings_4_nights_2_people)
  
log_listings_4_nights_2_people %>%
  group_by(room_type) %>%
  summarise(count=n())
room_typecount
Entire home/apt11335
Hotel room61
Private room2581
Shared room84
autoplot(model2)+ theme_bw()

get_regression_table(model2) 
termestimatestd_errorstatisticp_valuelower_ciupper_ci
intercept6.02 0.039156   0    5.94 6.09 
prop_type_simplified: Entire loft0.1720.0325.440    0.11 0.234
prop_type_simplified: Entire rental unit0.0930.0214.540    0.0530.133
prop_type_simplified: Other0.3270.0369.110    0.2560.397
prop_type_simplified: Private room in rental unit0.2740.0475.890    0.1830.366
number_of_reviews-0.0010    -14.3 0    -0.001-0.001
review_scores_rating-0.0290.007-4.060    -0.044-0.015
room_type: Hotel room0.1940.0912.130.0330.0160.372
room_type: Private room-0.6640.039-17   0    -0.741-0.588
room_type: Shared room-1.26 0.083-15.2 0    -1.42 -1.1  
get_regression_summaries(model2)
r_squaredadj_r_squaredmsermsesigmastatisticp_valuedfnobs
0.1180.1180.3520.5930.594162091.09e+04
mosaic::msummary(model2)
                                                  Estimate Std. Error t value
(Intercept)                                      6.016e+00  3.857e-02 155.950
prop_type_simplifiedEntire loft                  1.724e-01  3.168e-02   5.441
prop_type_simplifiedEntire rental unit           9.317e-02  2.052e-02   4.541
prop_type_simplifiedOther                        3.267e-01  3.585e-02   9.112
prop_type_simplifiedPrivate room in rental unit  2.744e-01  4.658e-02   5.891
number_of_reviews                               -1.160e-03  8.128e-05 -14.271
review_scores_rating                            -2.938e-02  7.228e-03  -4.064
room_typeHotel room                              1.939e-01  9.089e-02   2.134
room_typePrivate room                           -6.641e-01  3.908e-02 -16.995
room_typeShared room                            -1.262e+00  8.304e-02 -15.201
                                                Pr(>|t|)    
(Intercept)                                      < 2e-16 ***
prop_type_simplifiedEntire loft                 5.41e-08 ***
prop_type_simplifiedEntire rental unit          5.67e-06 ***
prop_type_simplifiedOther                        < 2e-16 ***
prop_type_simplifiedPrivate room in rental unit 3.95e-09 ***
number_of_reviews                                < 2e-16 ***
review_scores_rating                            4.85e-05 ***
room_typeHotel room                               0.0329 *  
room_typePrivate room                            < 2e-16 ***
room_typeShared room                             < 2e-16 ***

Residual standard error: 0.5936 on 10861 degrees of freedom
  (3190 observations deleted due to missingness)
Multiple R-squared:  0.1183,    Adjusted R-squared:  0.1176 
F-statistic: 161.9 on 9 and 10861 DF,  p-value: < 2.2e-16
car::vif(model2)  
                         GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 7.399788  4        1.284258
number_of_reviews    1.014343  1        1.007146
review_scores_rating 1.010966  1        1.005468
room_type            7.377840  3        1.395259

Comment: After running model 2, we found out that all the room type (including hotel room, private room, shared room) are statistically significant (5% significance level) in explaining the movement in price,since their above t-stat all lies above 2. More specifically, the hotel room would lead to the increase in rental price, while private room and shared room would make an opposite effect, with the underlying reasons stated above in EDA.

However, after we add the variables “room_type”, we found out that the coefficients of private room in rental unit and other property types has changed from negative to positive. Therefore, it’s reasonable to doubt whether adding the new variable has affected the explanatory power of the original variable. By looking at the VIF, we found out that answer: there exists co-linearity between prop_type_simplified and room_type, as their VIF are greater than 5.

Having known that they are co-linear, We want to determine which one we should keep to proceed with the analysis. Therefore, in model 2.2, we drop prop_type_simplified to compare with model 1.

model2.2 <- lm(price_4_nights ~ 
               number_of_reviews + 
               review_scores_rating + 
               room_type, 
             data = log_listings_4_nights_2_people)

autoplot(model2.2)+ theme_bw()

get_regression_table(model2.2) 
termestimatestd_errorstatisticp_valuelower_ciupper_ci
intercept6.12 0.034181   06.05 6.18 
number_of_reviews-0.0010    -14.2 0-0.001-0.001
review_scores_rating-0.03 0.007-4.190-0.045-0.016
room_type: Hotel room0.4220.0864.8900.2530.591
room_type: Private room-0.4720.015-31.1 0-0.502-0.442
room_type: Shared room-1.03 0.078-13.3 0-1.19 -0.881
get_regression_summaries(model2.2)
r_squaredadj_r_squaredmsermsesigmastatisticp_valuedfnobs
0.1110.110.3550.5960.596270051.09e+04
mosaic::msummary(model2.2)
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            6.119e+00  3.385e-02 180.731  < 2e-16 ***
number_of_reviews     -1.154e-03  8.141e-05 -14.177  < 2e-16 ***
review_scores_rating  -3.041e-02  7.256e-03  -4.191 2.80e-05 ***
room_typeHotel room    4.221e-01  8.633e-02   4.890 1.02e-06 ***
room_typePrivate room -4.719e-01  1.518e-02 -31.089  < 2e-16 ***
room_typeShared room  -1.034e+00  7.794e-02 -13.269  < 2e-16 ***

Residual standard error: 0.5961 on 10865 degrees of freedom
  (3190 observations deleted due to missingness)
Multiple R-squared:  0.1106,    Adjusted R-squared:  0.1101 
F-statistic: 270.1 on 5 and 10865 DF,  p-value: < 2.2e-16
car::vif(model2.2)  
                         GVIF Df GVIF^(1/(2*Df))
number_of_reviews    1.009077  1        1.004528
review_scores_rating 1.010210  1        1.005092
room_type            1.003841  3        1.000639

Comment: After running model 2.2, we found out that the explanatory power of room_type is much stronger than that of prop_type_simplified, as the adjust R-square has increased by roughly 0.03. Therefore, we only keep room_type in the following analysis.

3.1 Further variables/questions to explore on our own

Our dataset has many more variables, so here are some ideas on how we can extend our analysis

Q1. Are the number of bathrooms, bedrooms, beds, or size of the house (accomodates) significant predictors of price_4_nights? Or might these be co-linear variables?

But first, we need to adjust the data type for bathrooms to make it available for using.

log_listings_4_nights_2_people <- log_listings_4_nights_2_people %>%
  mutate(bathrooms_clean = parse_number(bathrooms_text))
correlation_matrix_data_2 <- log_listings_4_nights_2_people %>% 
  select(price, bedrooms, bathrooms,beds)
corr <- round(cor(correlation_matrix_data_2), 1)
ggcorrplot(corr)

log_listings_4_nights_2_people %>% #Correlation between review and price 
  select(price, bathrooms_clean, bedrooms,beds, accommodates) %>% 
  ggpairs(aes(alpha=0.2))+
  theme_bw()

model3 <- lm(price_4_nights ~ #Including bathrooms, beds, bedrooms and accommodated in the explanatory variables 
               number_of_reviews + 
               review_scores_rating + 
               room_type+
               bathrooms_clean+
               bedrooms+
               beds+
               accommodates, 
             data = log_listings_4_nights_2_people)

autoplot(model3)+ theme_bw()

get_regression_table(model3) 
termestimatestd_errorstatisticp_valuelower_ciupper_ci
intercept5.44 0.037146   0    5.37 5.51 
number_of_reviews-0.0010    -15.2 0    -0.001-0.001
review_scores_rating-0.0270.007-3.780    -0.04 -0.013
room_type: Hotel room0.4970.0835.980    0.3340.66 
room_type: Private room-0.3650.016-23.2 0    -0.396-0.334
room_type: Shared room-0.9190.073-12.5 0    -1.06 -0.775
bathrooms_clean0.2460.01714.7 0    0.2130.279
bedrooms0.1850.01512.3 0    0.1560.215
beds-0.02 0.007-2.810.005-0.034-0.006
accommodates0.0540.0068.440    0.0410.066
get_regression_summaries(model3)
r_squaredadj_r_squaredmsermsesigmastatisticp_valuedfnobs
0.2460.2450.3080.5550.555360099.97e+03
mosaic::msummary(model3)
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            5.442e+00  3.734e-02 145.762  < 2e-16 ***
number_of_reviews     -1.214e-03  8.014e-05 -15.151  < 2e-16 ***
review_scores_rating  -2.659e-02  7.033e-03  -3.781 0.000157 ***
room_typeHotel room    4.972e-01  8.310e-02   5.983 2.27e-09 ***
room_typePrivate room -3.649e-01  1.570e-02 -23.247  < 2e-16 ***
room_typeShared room  -9.194e-01  7.347e-02 -12.513  < 2e-16 ***
bathrooms_clean        2.463e-01  1.681e-02  14.658  < 2e-16 ***
bedrooms               1.853e-01  1.504e-02  12.321  < 2e-16 ***
beds                  -1.976e-02  7.030e-03  -2.810 0.004962 ** 
accommodates           5.384e-02  6.377e-03   8.442  < 2e-16 ***

Residual standard error: 0.5552 on 9961 degrees of freedom
  (4090 observations deleted due to missingness)
Multiple R-squared:  0.2456,    Adjusted R-squared:  0.2449 
F-statistic: 360.3 on 9 and 9961 DF,  p-value: < 2.2e-16
car::vif(model3)  
                         GVIF Df GVIF^(1/(2*Df))
number_of_reviews    1.011838  1        1.005901
review_scores_rating 1.010564  1        1.005268
room_type            1.189674  3        1.029370
bathrooms_clean      1.646086  1        1.282999
bedrooms             2.383258  1        1.543780
beds                 2.276964  1        1.508961
accommodates         2.857174  1        1.690318

Comments: We did not identify any GVIF figure above 5 in the regression run. However, after running the correlation analysis above, we do observe the high correlations between the four variables, including “bedroom”, “bathrooms”, “bed”, and “accommodate”, which we consider intuitively reasonable. Therefore, to arrive at a regression model which is as powerful as possible, we decided to only keep one variable from the four to proceed.

We want to determine which one we should keep among bathrooms, bedrooms, beds, and accommodates, to proceed with the analysis.

model3.2 <- lm(price_4_nights ~ #keep bathrooms
               number_of_reviews + 
               review_scores_rating + 
               bathrooms_clean+
               room_type, 
             data = log_listings_4_nights_2_people)

mosaic::msummary(model3.2)
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            5.5698793  0.0355384 156.729  < 2e-16 ***
number_of_reviews     -0.0011674  0.0000771 -15.141  < 2e-16 ***
review_scores_rating  -0.0301376  0.0068698  -4.387 1.16e-05 ***
bathrooms_clean        0.4735712  0.0132128  35.842  < 2e-16 ***
room_typeHotel room    0.4857933  0.0833597   5.828 5.78e-09 ***
room_typePrivate room -0.4471458  0.0143990 -31.054  < 2e-16 ***
room_typeShared room  -1.0002917  0.0743096 -13.461  < 2e-16 ***

Residual standard error: 0.5635 on 10845 degrees of freedom
  (3209 observations deleted due to missingness)
Multiple R-squared:  0.2038,    Adjusted R-squared:  0.2033 
F-statistic: 462.6 on 6 and 10845 DF,  p-value: < 2.2e-16
car::vif(model3.2) 
                         GVIF Df GVIF^(1/(2*Df))
number_of_reviews    1.009105  1        1.004542
review_scores_rating 1.010265  1        1.005119
bathrooms_clean      1.002381  1        1.001190
room_type            1.006311  3        1.001049
model3.3 <- lm(price_4_nights ~ #keep bedrooms
               number_of_reviews + 
               review_scores_rating + 
               bedrooms+
               room_type, 
             data = log_listings_4_nights_2_people)

mosaic::msummary(model3.3)
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            5.654e+00  3.593e-02 157.375  < 2e-16 ***
number_of_reviews     -1.175e-03  8.117e-05 -14.476  < 2e-16 ***
review_scores_rating  -2.729e-02  7.110e-03  -3.838 0.000125 ***
bedrooms               3.611e-01  1.009e-02  35.789  < 2e-16 ***
room_typeHotel room    4.406e-01  8.263e-02   5.331 9.95e-08 ***
room_typePrivate room -3.971e-01  1.487e-02 -26.710  < 2e-16 ***
room_typeShared room  -9.434e-01  7.392e-02 -12.762  < 2e-16 ***

Residual standard error: 0.5645 on 10021 degrees of freedom
  (4033 observations deleted due to missingness)
Multiple R-squared:  0.2213,    Adjusted R-squared:  0.2208 
F-statistic: 474.7 on 6 and 10021 DF,  p-value: < 2.2e-16
car::vif(model3.3) 
                         GVIF Df GVIF^(1/(2*Df))
number_of_reviews    1.009046  1        1.004513
review_scores_rating 1.010345  1        1.005159
bedrooms             1.038187  1        1.018915
room_type            1.041589  3        1.006814
model3.4 <- lm(price_4_nights ~ #keep beds
               number_of_reviews + 
               review_scores_rating + 
               beds+
               room_type, 
             data = log_listings_4_nights_2_people)

mosaic::msummary(model3.4)
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            5.876e+00  3.443e-02 170.650  < 2e-16 ***
number_of_reviews     -1.219e-03  7.926e-05 -15.386  < 2e-16 ***
review_scores_rating  -2.814e-02  7.085e-03  -3.973 7.16e-05 ***
beds                   1.209e-01  4.862e-03  24.863  < 2e-16 ***
room_typeHotel room    4.244e-01  8.398e-02   5.054 4.40e-07 ***
room_typePrivate room -3.949e-01  1.513e-02 -26.104  < 2e-16 ***
room_typeShared room  -9.966e-01  7.584e-02 -13.141  < 2e-16 ***

Residual standard error: 0.5799 on 10819 degrees of freedom
  (3235 observations deleted due to missingness)
Multiple R-squared:  0.1581,    Adjusted R-squared:  0.1576 
F-statistic: 338.5 on 6 and 10819 DF,  p-value: < 2.2e-16
car::vif(model3.4)
                         GVIF Df GVIF^(1/(2*Df))
number_of_reviews    1.009965  1        1.004970
review_scores_rating 1.010254  1        1.005114
beds                 1.042868  1        1.021209
room_type            1.045573  3        1.007455
model3.5 <- lm(price_4_nights ~ #keep accomodates
               number_of_reviews + 
               review_scores_rating + 
               accommodates+
               room_type, 
             data = log_listings_4_nights_2_people)

mosaic::msummary(model3.5)
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            5.640e+00  3.516e-02 160.432  < 2e-16 ***
number_of_reviews     -1.251e-03  7.747e-05 -16.145  < 2e-16 ***
review_scores_rating  -2.720e-02  6.901e-03  -3.941 8.16e-05 ***
accommodates           1.344e-01  3.967e-03  33.892  < 2e-16 ***
room_typeHotel room    4.792e-01  8.212e-02   5.836 5.49e-09 ***
room_typePrivate room -2.999e-01  1.530e-02 -19.600  < 2e-16 ***
room_typeShared room  -8.821e-01  7.426e-02 -11.878  < 2e-16 ***

Residual standard error: 0.5669 on 10864 degrees of freedom
  (3190 observations deleted due to missingness)
Multiple R-squared:  0.1956,    Adjusted R-squared:  0.1952 
F-statistic: 440.3 on 6 and 10864 DF,  p-value: < 2.2e-16
car::vif(model3.5)
                         GVIF Df GVIF^(1/(2*Df))
number_of_reviews    1.010446  1        1.005209
review_scores_rating 1.010401  1        1.005187
accommodates         1.128111  1        1.062126
room_type            1.130313  3        1.020626

Comments: The Adjusted R-squared for model 3.2, 3.3, 3.4, 3.5 are 0.2033, 0.2208, 0.1576, 0.1952. Therefore, we keep bedrooms and exclude the rest of it.

model4 <- lm(price_4_nights ~ #removing bathrooms, beds, and accommodates to correct for the effect of multi-collinearity among these variables 
               number_of_reviews + 
               review_scores_rating + 
               room_type+
               bedrooms, 
             data = log_listings_4_nights_2_people)

autoplot(model4)+ theme_bw()

get_regression_table(model4) 
termestimatestd_errorstatisticp_valuelower_ciupper_ci
intercept5.65 0.036157   05.58 5.72 
number_of_reviews-0.0010    -14.5 0-0.001-0.001
review_scores_rating-0.0270.007-3.840-0.041-0.013
room_type: Hotel room0.4410.0835.3300.2790.603
room_type: Private room-0.3970.015-26.7 0-0.426-0.368
room_type: Shared room-0.9430.074-12.8 0-1.09 -0.798
bedrooms0.3610.01 35.8 00.3410.381
get_regression_summaries(model4)
r_squaredadj_r_squaredmsermsesigmastatisticp_valuedfnobs
0.2210.2210.3180.5640.565475061e+04
mosaic::msummary(model4)
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            5.654e+00  3.593e-02 157.375  < 2e-16 ***
number_of_reviews     -1.175e-03  8.117e-05 -14.476  < 2e-16 ***
review_scores_rating  -2.729e-02  7.110e-03  -3.838 0.000125 ***
room_typeHotel room    4.406e-01  8.263e-02   5.331 9.95e-08 ***
room_typePrivate room -3.971e-01  1.487e-02 -26.710  < 2e-16 ***
room_typeShared room  -9.434e-01  7.392e-02 -12.762  < 2e-16 ***
bedrooms               3.611e-01  1.009e-02  35.789  < 2e-16 ***

Residual standard error: 0.5645 on 10021 degrees of freedom
  (4033 observations deleted due to missingness)
Multiple R-squared:  0.2213,    Adjusted R-squared:  0.2208 
F-statistic: 474.7 on 6 and 10021 DF,  p-value: < 2.2e-16
car::vif(model4)  
                         GVIF Df GVIF^(1/(2*Df))
number_of_reviews    1.009046  1        1.004513
review_scores_rating 1.010345  1        1.005159
room_type            1.041589  3        1.006814
bedrooms             1.038187  1        1.018915

Q2. Do superhosts (host_is_superhost) command a pricing premium, after controlling for other variables?

model5 <- lm(price_4_nights ~ #adding host_is_superhost 
               number_of_reviews + 
               review_scores_rating + 
               room_type+
               bedrooms+
               host_is_superhost, 
             data = log_listings_4_nights_2_people)

autoplot(model5)+ theme_bw()

get_regression_table(model5) 
termestimatestd_errorstatisticp_valuelower_ciupper_ci
intercept5.64 0.036157   0    5.57 5.71 
number_of_reviews-0.0010    -13.2 0    -0.001-0.001
review_scores_rating-0.0230.007-3.230.001-0.037-0.009
room_type: Hotel room0.4490.0835.430    0.2870.611
room_type: Private room-0.3990.015-26.8 0    -0.428-0.37 
room_type: Shared room-0.9490.074-12.8 0    -1.09 -0.804
bedrooms0.3610.01 35.9 0    0.3420.381
host_is_superhostTRUE-0.0540.014-3.810    -0.082-0.026
get_regression_summaries(model5)
r_squaredadj_r_squaredmsermsesigmastatisticp_valuedfnobs
0.2220.2220.3180.5640.564410071e+04
mosaic::msummary(model5)
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            5.6441760  0.0359902 156.826  < 2e-16 ***
number_of_reviews     -0.0010985  0.0000835 -13.156  < 2e-16 ***
review_scores_rating  -0.0232280  0.0071872  -3.232 0.001234 ** 
room_typeHotel room    0.4487310  0.0825965   5.433 5.68e-08 ***
room_typePrivate room -0.3987332  0.0148617 -26.830  < 2e-16 ***
room_typeShared room  -0.9492805  0.0738752 -12.850  < 2e-16 ***
bedrooms               0.3614135  0.0100810  35.851  < 2e-16 ***
host_is_superhostTRUE -0.0544075  0.0142720  -3.812 0.000139 ***

Residual standard error: 0.5641 on 10019 degrees of freedom
  (4034 observations deleted due to missingness)
Multiple R-squared:  0.2225,    Adjusted R-squared:  0.2219 
F-statistic: 409.6 on 7 and 10019 DF,  p-value: < 2.2e-16
car::vif(model5)  
                         GVIF Df GVIF^(1/(2*Df))
number_of_reviews    1.069533  1        1.034182
review_scores_rating 1.033888  1        1.016803
room_type            1.043856  3        1.007179
bedrooms             1.038241  1        1.018941
host_is_superhost    1.093506  1        1.045708

Comments: After running the model with the additional variable of “Host is superhost”, we observed that the variable “host_is_super host” is significant variable in determining the price (with the absolute value of t-stats greater than 2), but with a relatively lower negative correlation. It is reasonable that “host_is_super host” is based on the quality of service provided. Generally, the tourists will consider the “value for money” as a key factor for giving reviews to the service providers. So we think the “superhosts” might have relatively lower price for services of same level of quality. It could be the reason why the two variables showing negative relation.

We also see that the adjusted R-square has slightly increased from 0.2208 to 0.2219 since we added the new variable regarding the superhost status, which demonstrates that the variable does contribute to the variation of the price. The new variable made the regression model explain more about the variation of the prices in Milan.

Q3. Some hosts allow you to immediately book their listing (instant_bookable == TRUE), while a non-trivial proportion don’t. After controlling for other variables, is instant_bookable a significant predictor of price_4_nights?

model6 <- lm(price_4_nights ~ #adding instant_bookable 
               number_of_reviews + 
               review_scores_rating + 
               room_type+
               bedrooms+
               host_is_superhost+
               instant_bookable, 
             data = log_listings_4_nights_2_people)

autoplot(model6)+ theme_bw()

get_regression_table(model6) 
termestimatestd_errorstatisticp_valuelower_ciupper_ci
intercept5.59 0.037152   0    5.51 5.66 
number_of_reviews-0.0010    -13.8 0    -0.001-0.001
review_scores_rating-0.0190.007-2.650.008-0.033-0.005
room_type: Hotel room0.4060.0834.910    0.2440.567
room_type: Private room-0.3840.015-25.6 0    -0.413-0.354
room_type: Shared room-0.9370.074-12.7 0    -1.08 -0.792
bedrooms0.3620.01 36.1 0    0.3430.382
host_is_superhostTRUE-0.0610.014-4.250    -0.089-0.033
instant_bookableTRUE0.0890.0127.670    0.0660.111
get_regression_summaries(model6)
r_squaredadj_r_squaredmsermsesigmastatisticp_valuedfnobs
0.2270.2260.3160.5620.562368081e+04
mosaic::msummary(model6)
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            5.586e+00  3.669e-02 152.258  < 2e-16 ***
number_of_reviews     -1.151e-03  8.354e-05 -13.774  < 2e-16 ***
review_scores_rating  -1.904e-02  7.187e-03  -2.649  0.00809 ** 
room_typeHotel room    4.055e-01  8.255e-02   4.912 9.15e-07 ***
room_typePrivate room -3.835e-01  1.495e-02 -25.649  < 2e-16 ***
room_typeShared room  -9.368e-01  7.368e-02 -12.715  < 2e-16 ***
bedrooms               3.625e-01  1.005e-02  36.057  < 2e-16 ***
host_is_superhostTRUE -6.057e-02  1.425e-02  -4.249 2.17e-05 ***
instant_bookableTRUE   8.872e-02  1.157e-02   7.668 1.90e-14 ***

Residual standard error: 0.5625 on 10018 degrees of freedom
  (4034 observations deleted due to missingness)
Multiple R-squared:  0.227, Adjusted R-squared:  0.2264 
F-statistic: 367.8 on 8 and 10018 DF,  p-value: < 2.2e-16
car::vif(model6)
                         GVIF Df GVIF^(1/(2*Df))
number_of_reviews    1.076657  1        1.037621
review_scores_rating 1.039901  1        1.019755
room_type            1.068460  3        1.011097
bedrooms             1.038440  1        1.019039
host_is_superhost    1.096988  1        1.047372
instant_bookable     1.041074  1        1.020330

Comments: After adding the “instant_bookable” variable, we observed that the Adjusted R-Square has increased further from the previous 0.2219 to 0.2264. We have concluded that the new model explains more variation of the prices and makes the regression model even stronger.

The stats show that “Instant_bookable” is a statistically significant variable with a positive coefficient of 0.089, illustrating the positive relationship between being instant-bookable and the prices. Firstly, the “instantly_bookable” feature offers more flexible choices for the customer and save the time for approval. The customers with urgent demands tend to have higher willingness to pay, resulting in the relatively higher prices of the corresponding rooms on Airbnb. Secondly, the feature requires high response rates and extremely flexible arrangement of the home owner when they received instantly booked orders, which drives up their operating cost, therefore increasing the market prices.

We conclude that we need to keep this variables in the regression model to proceed further regression analysis.

Q4.Is neighbourhood_simplified a predictor of price_4_nights?

model7 <- lm(price_4_nights ~ #Adding neighbourhood_simplified
               number_of_reviews + 
               review_scores_rating + 
               room_type+
               bedrooms+
               host_is_superhost+
               instant_bookable+
               neighbourhood_simplified, 
             data = log_listings_4_nights_2_people)

autoplot(model7)+ theme_bw()

get_regression_table(model7) 
termestimatestd_errorstatisticp_valuelower_ciupper_ci
intercept5.62 0.037151   0    5.55 5.7  
number_of_reviews-0.0010    -14.2 0    -0.001-0.001
review_scores_rating-0.0190.007-2.590.01 -0.033-0.005
room_type: Hotel room0.3850.0824.670    0.2230.546
room_type: Private room-0.3790.015-25.4 0    -0.408-0.35 
room_type: Shared room-0.9430.073-12.8 0    -1.09 -0.799
bedrooms0.3640.01 36.3 0    0.3440.383
host_is_superhostTRUE-0.0630.014-4.420    -0.091-0.035
instant_bookableTRUE0.0860.0127.480    0.0640.109
neighbourhood_simplified: Northwest-0.1380.016-8.580    -0.169-0.106
neighbourhood_simplified: Southeast-0.0430.014-3.1 0.002-0.071-0.016
neighbourhood_simplified: Southwest-0.0480.018-2.680.007-0.084-0.013
get_regression_summaries(model7)
r_squaredadj_r_squaredmsermsesigmastatisticp_valuedfnobs
0.2330.2320.3140.560.562760111e+04
mosaic::msummary(model7)
                                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)                        5.624e+00  3.715e-02 151.379  < 2e-16 ***
number_of_reviews                 -1.183e-03  8.339e-05 -14.183  < 2e-16 ***
review_scores_rating              -1.856e-02  7.164e-03  -2.591  0.00959 ** 
room_typeHotel room                3.846e-01  8.232e-02   4.673 3.01e-06 ***
room_typePrivate room             -3.790e-01  1.492e-02 -25.406  < 2e-16 ***
room_typeShared room              -9.425e-01  7.342e-02 -12.837  < 2e-16 ***
bedrooms                           3.637e-01  1.002e-02  36.288  < 2e-16 ***
host_is_superhostTRUE             -6.279e-02  1.421e-02  -4.420 9.96e-06 ***
instant_bookableTRUE               8.645e-02  1.156e-02   7.479 8.11e-14 ***
neighbourhood_simplifiedNorthwest -1.377e-01  1.605e-02  -8.579  < 2e-16 ***
neighbourhood_simplifiedSoutheast -4.325e-02  1.395e-02  -3.099  0.00194 ** 
neighbourhood_simplifiedSouthwest -4.835e-02  1.801e-02  -2.684  0.00728 ** 

Residual standard error: 0.5605 on 10015 degrees of freedom
  (4034 observations deleted due to missingness)
Multiple R-squared:  0.2327,    Adjusted R-squared:  0.2319 
F-statistic: 276.1 on 11 and 10015 DF,  p-value: < 2.2e-16
car::vif(model7)
                             GVIF Df GVIF^(1/(2*Df))
number_of_reviews        1.080520  1        1.039481
review_scores_rating     1.040350  1        1.019976
room_type                1.072696  3        1.011765
bedrooms                 1.039201  1        1.019412
host_is_superhost        1.097400  1        1.047569
instant_bookable         1.046336  1        1.022906
neighbourhood_simplified 1.016307  3        1.002700

Comments: After running model 7, we found out that neighbourhood also has the explanatory power in predicting the price at 5% signifiance level. More specifically, Airbnb located in Northwest, Southeast, and Southwest would tend to have a lower price than that in Northeast region.

Q5. What is the effect of avalability_30 or reviews_per_month on price_4_nights, after we control for other variables?

model8 <- lm(price_4_nights ~ #Adding availability_30
               number_of_reviews + 
               review_scores_rating + 
               room_type+
               bedrooms+
               host_is_superhost+
               instant_bookable+
               neighbourhood_simplified+
               availability_30, 
             data = log_listings_4_nights_2_people)

autoplot(model8)+ theme_bw()

get_regression_table(model8) 
termestimatestd_errorstatisticp_valuelower_ciupper_ci
intercept5.47 0.036153   0    5.4  5.54 
number_of_reviews-0.0010    -13.8 0    -0.001-0.001
review_scores_rating-0.0170.007-2.540.011-0.031-0.004
room_type: Hotel room0.2370.0793.010.0030.0830.391
room_type: Private room-0.4040.014-28.4 0    -0.432-0.376
room_type: Shared room-0.9640.07 -13.8 0    -1.1  -0.827
bedrooms0.3620.01 38   0    0.3440.381
host_is_superhostTRUE-0.0450.014-3.350.001-0.072-0.019
instant_bookableTRUE0.1110.01110.1 0    0.0890.133
neighbourhood_simplified: Northwest-0.1320.015-8.640    -0.162-0.102
neighbourhood_simplified: Southeast-0.0320.013-2.430.015-0.058-0.006
neighbourhood_simplified: Southwest-0.0460.017-2.710.007-0.08 -0.013
availability_300.0170.00132   0    0.0160.018
get_regression_summaries(model8)
r_squaredadj_r_squaredmsermsesigmastatisticp_valuedfnobs
0.3040.3030.2850.5340.5343640121e+04
mosaic::msummary(model8)
                                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)                        5.469e+00  3.572e-02 153.118  < 2e-16 ***
number_of_reviews                 -1.095e-03  7.949e-05 -13.776  < 2e-16 ***
review_scores_rating              -1.731e-02  6.825e-03  -2.536 0.011234 *  
room_typeHotel room                2.366e-01  7.856e-02   3.012 0.002606 ** 
room_typePrivate room             -4.039e-01  1.423e-02 -28.383  < 2e-16 ***
room_typeShared room              -9.642e-01  6.995e-02 -13.784  < 2e-16 ***
bedrooms                           3.624e-01  9.547e-03  37.958  < 2e-16 ***
host_is_superhostTRUE             -4.532e-02  1.354e-02  -3.346 0.000823 ***
instant_bookableTRUE               1.110e-01  1.104e-02  10.053  < 2e-16 ***
neighbourhood_simplifiedNorthwest -1.321e-01  1.530e-02  -8.636  < 2e-16 ***
neighbourhood_simplifiedSoutheast -3.236e-02  1.330e-02  -2.433 0.014974 *  
neighbourhood_simplifiedSouthwest -4.649e-02  1.716e-02  -2.709 0.006759 ** 
availability_30                    1.671e-02  5.229e-04  31.951  < 2e-16 ***

Residual standard error: 0.534 on 10014 degrees of freedom
  (4034 observations deleted due to missingness)
Multiple R-squared:  0.3037,    Adjusted R-squared:  0.3029 
F-statistic:   364 on 12 and 10014 DF,  p-value: < 2.2e-16
car::vif(model8)
                             GVIF Df GVIF^(1/(2*Df))
number_of_reviews        1.081808  1        1.040100
review_scores_rating     1.040385  1        1.019993
room_type                1.079578  3        1.012843
bedrooms                 1.039219  1        1.019421
host_is_superhost        1.099192  1        1.048423
instant_bookable         1.051417  1        1.025386
neighbourhood_simplified 1.017011  3        1.002815
availability_30          1.017988  1        1.008954

Comments:Following the addition of the variable ‘availability_30’ the r-squared value increased to 0.304 from 0.233. This is good increase and suggests that this model is a stronger indicator of the variation of prices - suggesting this is a stronger regression model. Additionally the t-statistic, at 31.95, is a very strong indication that this is a significant variable. The reason for this variable being significant in the price of the property would be because cheaper properties are likely to be rented first leaving more expensive properties on the site. This explains the positive coefficient.
Due to the strong significance we will keep the ‘availability_30’ variable in the model.

Additional Factors That Might Imporve the Model: Apart from all the variables given in the dataframe, some other factors that might help explain the price include “distance to Duomo di Milano”: the closer to the Cathedral, the more expensive is the Airbnb, since it brings more convenience to the visitor to travel around Milan. This would not lead to co-linearity since the way we group neightborhood would not tell us about the distance to central Milan. In addition, “season” would have some explanatory power as well, since different seasons would have different number of visitors, which would in turn affect the demand for Airbnb and hence price.

3.2 Diagnostics, collinearity, summary tables

  1. Create a summary table, using huxtable (https://mfa2022.netlify.app/example/modelling_side_by_side_tables/) that shows which models you worked on, which predictors are significant, the adjusted \(R^2\), and the Residual Standard Error.
huxreg(model1,model2,model3,model4,model5,model6,model7,model8,
       statistics = c('#observations' = 'nobs', 
                      'R squared' = 'r.squared', 
                      'Adj. R Squared' = 'adj.r.squared', 
                      'Residual SE' = 'sigma'), 
       bold_signif = 0.05
       )

(1)(2)(3)(4)(5)(6)(7)(8)
(Intercept)6.013 ***6.016 ***5.442 ***5.654 ***5.644 ***5.586 ***5.624 ***5.469 ***
(0.039)   (0.039)   (0.037)   (0.036)   (0.036)   (0.037)   (0.037)   (0.036)   
prop_type_simplifiedEntire loft0.173 ***0.172 ***                                                
(0.032)   (0.032)                                                   
prop_type_simplifiedEntire rental unit0.094 ***0.093 ***                                                
(0.021)   (0.021)                                                   
prop_type_simplifiedOther-0.093 ***0.327 ***                                                
(0.027)   (0.036)                                                   
prop_type_simplifiedPrivate room in rental unit-0.389 ***0.274 ***                                                
(0.026)   (0.047)                                                   
number_of_reviews-0.001 ***-0.001 ***-0.001 ***-0.001 ***-0.001 ***-0.001 ***-0.001 ***-0.001 ***
(0.000)   (0.000)   (0.000)   (0.000)   (0.000)   (0.000)   (0.000)   (0.000)   
review_scores_rating-0.029 ***-0.029 ***-0.027 ***-0.027 ***-0.023 ** -0.019 ** -0.019 ** -0.017 *  
(0.007)   (0.007)   (0.007)   (0.007)   (0.007)   (0.007)   (0.007)   (0.007)   
room_typeHotel room        0.194 *  0.497 ***0.441 ***0.449 ***0.406 ***0.385 ***0.237 ** 
        (0.091)   (0.083)   (0.083)   (0.083)   (0.083)   (0.082)   (0.079)   
room_typePrivate room        -0.664 ***-0.365 ***-0.397 ***-0.399 ***-0.384 ***-0.379 ***-0.404 ***
        (0.039)   (0.016)   (0.015)   (0.015)   (0.015)   (0.015)   (0.014)   
room_typeShared room        -1.262 ***-0.919 ***-0.943 ***-0.949 ***-0.937 ***-0.943 ***-0.964 ***
        (0.083)   (0.073)   (0.074)   (0.074)   (0.074)   (0.073)   (0.070)   
bathrooms_clean                0.246 ***                                        
                (0.017)                                           
bedrooms                0.185 ***0.361 ***0.361 ***0.362 ***0.364 ***0.362 ***
                (0.015)   (0.010)   (0.010)   (0.010)   (0.010)   (0.010)   
beds                -0.020 **                                         
                (0.007)                                           
accommodates                0.054 ***                                        
                (0.006)                                           
host_is_superhostTRUE                                -0.054 ***-0.061 ***-0.063 ***-0.045 ***
                                (0.014)   (0.014)   (0.014)   (0.014)   
instant_bookableTRUE                                        0.089 ***0.086 ***0.111 ***
                                        (0.012)   (0.012)   (0.011)   
neighbourhood_simplifiedNorthwest                                                -0.138 ***-0.132 ***
                                                (0.016)   (0.015)   
neighbourhood_simplifiedSoutheast                                                -0.043 ** -0.032 *  
                                                (0.014)   (0.013)   
neighbourhood_simplifiedSouthwest                                                -0.048 ** -0.046 ** 
                                                (0.018)   (0.017)   
availability_30                                                        0.017 ***
                                                        (0.001)   
#observations10871        10871        9971        10028        10027        10027        10027        10027        
R squared0.081    0.118    0.246    0.221    0.222    0.227    0.233    0.304    
Adj. R Squared0.080    0.118    0.245    0.221    0.222    0.226    0.232    0.303    
Residual SE0.606    0.594    0.555    0.565    0.564    0.562    0.560    0.534    
*** p < 0.001; ** p < 0.01; * p < 0.05.
Conclusion:Model 8 is the best fit model, with the higest Adjusted R-squared among all the model.

  1. Suppose you are planning to visit the city you have been assigned to over reading week, and you want to stay in an Airbnb. Find Airbnb’s in your destination city that are apartments with a private room, have at least 10 reviews, and an average rating of at least 90. Use your best model to predict the total cost to stay at this Airbnb for 4 nights. Include the appropriate 95% interval with your prediction. Report the point prediction and interval in terms of price_4_nights.
filtered_dataset <- listings %>%
  filter(number_of_reviews >= 10,review_scores_rating >= 4.5, room_type == "Private room") 

model_prediction <- 
  data.frame(predict(model8, newdata = filtered_dataset, interval = "prediction")) %>% 
  mutate(price = exp(fit),
         CI_lower = exp(lwr),
         CI_upper = exp(upr)) %>%
  select(-fit, -lwr, -upr)
head(model_prediction)
priceCI_lowerCI_upper
606212  1.73e+03
21575.5613       
14651.1417       
19668.9560       
20772.7591       
19367.8550       
ggplot(model_prediction, aes(x = price)) +
  geom_density()+
  labs(title="Price Distribution of Suitable Airbnb", x="Pricing") +
  theme(axis.text.y = element_blank()) 

In the final data frame we can observe the Predicted price and the 95% confidence intervals. The predicted price has been calculated using Model 8 that has an R2 of 0.28. The low R2 is responsible for the large Confidence Intervals values.

4 Acknowledgements